Skip to content

perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift#400

Merged
vanceingalls merged 2 commits intomainfrom
perf/p0-1b-perf-tests-for-fps-scrub-drift
Apr 23, 2026
Merged

perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift#400
vanceingalls merged 2 commits intomainfrom
perf/p0-1b-perf-tests-for-fps-scrub-drift

Conversation

@vanceingalls
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls commented Apr 21, 2026

Summary

Second slice of P0-1 from the player perf proposal: plugs the three steady-state scenarios — sustained playback FPS, scrub latency, and media-sync drift — into the perf gate that landed in #399. Adds the multi-video fixture they all share, wires three new shards into CI, and seeds one new baseline (droppedFramesMax).

Why

#399 stood up the harness and proved it with a single load-time scenario. By itself that's enough to catch regressions in initial composition setup, but it can't catch the things players actually fail at in production:

Each of these is a target metric in the proposal with a concrete budget. This PR turns those budgets into gated CI signals and produces continuous data for them on every player/core/runtime change.

What changed

Fixture — packages/player/tests/perf/fixtures/10-video-grid/

  • index.html: 10-second composition, 1920×1080, 30 fps, with 10 simultaneously-decoding video tiles in a 5×2 grid plus a subtle GSAP scale "breath" on each tile (so the rAF/RVFC loops have real work to do without GSAP dominating the budget the decoder needs).
  • sample.mp4: small (~190 KB) clip checked in so the fixture is hermetic — no external CDN dependency, identical bytes on every run.
  • Same data-composition-id="main" host pattern as gsap-heavy, so the existing harness loader works without changes.

02-fps.ts — sustained playback frame rate

  • Loads 10-video-grid, calls player.play(), samples requestAnimationFrame callbacks inside the iframe for 5 s.
  • Crucial sequencing: install the rAF sampler before play(), wait for __player.isPlaying() === true, then reset the sample buffer — otherwise the postMessage round-trip ramp-up window drags the average down by 5–10 fps.
  • FPS = (samples − 1) / (lastTs − firstTs in s); uses rAF timestamps (the same ones the compositor saw) rather than wall-clock setTimeout, so we're measuring real frame production.
  • Dropped-frame definition matches Chrome DevTools: gap > 1.5× (1000/60 ms) ≈ 25 ms = "missed at least one vsync."
  • Aggregation across runs: min(fps) and max(droppedFrames) — worst case wins, since the proposal asserts a floor on fps and a ceiling on drops.
  • Emits playback_fps_min (higher-is-better, baseline fpsMin = 55) and playback_dropped_frames_max (lower-is-better, baseline droppedFramesMax = 3).

04-scrub.ts — scrub latency, inline + isolated

  • Loads 10-video-grid, pauses, then issues 10 seek calls in two batches: first the synchronous inline path (<hyperframes-player>'s default same-origin _trySyncSeek), then the isolated path (forced by replacing _trySyncSeek with () => false, which makes the player fall back to the postMessage _sendControl("seek") bridge that cross-origin embeds and pre-feat(player): synchronous seek() API with same-origin detection #397 builds use).
  • Inline runs first so the isolated mode's monkey-patch can't bleed back into the inline samples.
  • Detection: a rAF watcher inside the iframe polls __player.getTime() until it's within MATCH_TOLERANCE_S = 0.05 s of the requested target. Tolerance exists because the postMessage bridge converts seconds → frame number → seconds, and that round-trip can introduce sub-frame quantization drift even for targets on the canonical fps grid.
  • Timing: performance.timeOrigin + performance.now() in both contexts. timeOrigin is consistent across same-process frames, so t1 − t0 is a true wall-clock latency, not a host-only or iframe-only stopwatch.
  • Targets alternate forward/backward (1.0, 7.0, 2.0, 8.0, 3.0, 9.0, 4.0, 6.0, 5.0, 0.5) so no two consecutive seeks land near each other — protects the rAF watcher from matching against a stale getTime() value before the seek command is processed.
  • Aggregation: percentile(95) across the pooled per-seek latencies from every run. With 10 seeks × 2 modes × 3 runs we get 30 samples per mode per CI shard, enough for a stable p95.
  • Emits scrub_latency_p95_inline_ms (lower-is-better, baseline scrubLatencyP95InlineMs = 33) and scrub_latency_p95_isolated_ms (lower-is-better, baseline scrubLatencyP95IsolatedMs = 80).

05-drift.ts — media sync drift

  • Loads 10-video-grid, plays 6 s, instruments every video[data-start] element with requestVideoFrameCallback. Each callback records (compositionTime, actualMediaTime) plus a snapshot of the clip transform (clipStart, clipMediaStart, clipPlaybackRate).
  • Drift = |actualMediaTime − ((compTime − clipStart) × clipPlaybackRate + clipMediaStart)| — the same transform the runtime applies in packages/core/src/runtime/media.ts, snapshotted once at sampler install so the per-frame work is just subtract + multiply + abs.
  • Sustain window is 6 s (not the proposal's 10 s) because the fixture composition is exactly 10 s long and we want headroom before the end-of-timeline pause/clamp behavior. With 10 videos × ~25 fps × 6 s we still pool ~1500 samples per run — more than enough for a stable p95.
  • Same "reset buffer after play confirmed" gotcha as 02-fps.ts: frames captured during the postMessage round-trip would compare a non-zero mediaTime against getTime() === 0 and inflate drift by hundreds of ms.
  • Aggregation: max() and percentile(95) across the pooled per-frame drifts. The proposal's max-drift ceiling of 500 ms is intentional — the runtime hard-resyncs when |currentTime − relTime| > 0.5 s, so a regression past 500 ms means the corrective resync kicked in and the viewer saw a jump.
  • Emits media_drift_max_ms (lower-is-better, baseline driftMaxMs = 500) and media_drift_p95_ms (lower-is-better, baseline driftP95Ms = 100).

Wiring

  • packages/player/tests/perf/index.ts: add fps, scrub, drift to ScenarioId, DEFAULT_RUNS, the default scenario list (--scenarios defaults to all four), and three new dispatch branches.
  • packages/player/tests/perf/perf-gate.ts: add droppedFramesMax: number to PerfBaseline. Other baseline keys for these scenarios were already seeded in perf(player): p0-1a perf test infra + composition-load smoke test #399.
  • packages/player/tests/perf/baseline.json: add droppedFramesMax: 3.
  • .github/workflows/player-perf.yml: three new matrix shards (fps / scrub / drift) at runs: 3. Same paths-filter and same artifact-upload pattern as the load shard, so the summary job aggregates them automatically.

Methodology highlights

These three patterns recur in all three scenarios and are worth noting because they're load-bearing for the numbers we report:

  1. Reset buffer after play-confirmed. The play() API is async (postMessage), so any samples captured before __player.isPlaying() === true belong to ramp-up, not steady-state. Both 02-fps and 05-drift clear __perfRafSamples / __perfDriftSamples after the wait. Without this, fps drops 5–10 and drift inflates by hundreds of ms.
  2. Iframe-side timing. All three scenarios time inside the iframe (performance.timeOrigin + performance.now() for scrub, rAF/RVFC timestamps for fps/drift) rather than host-side. The iframe is what the user sees; host-side timing would conflate Puppeteer's IPC overhead with real player latency.
  3. Stop sampling before pause. Sampler is deactivated before pause() is issued, so the pause command's postMessage round-trip can't perturb the tail of the measurement window.

Test plan

  • Local: bun run player:perf runs all four scenarios end-to-end on the 10-video-grid fixture.
  • Each scenario produces metrics matching its declared baselineKey so perf-gate.ts can find them.
  • Typecheck, lint, format pass on the new files.
  • Existing player unit tests untouched (no production code changes in this PR).
  • First CI run will confirm the new shards complete inside the workflow timeout and that the summary job picks up their metrics.json artifacts.

Stack

Step P0-1b of the player perf proposal. Builds on:

Followed by:

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario design is careful and the methodology notes in each file are a good sign — the "alternate forward/backward seek targets so the rAF watcher doesn't match a stale getTime() value" trick in 04-scrub.ts is exactly the kind of detail that makes or breaks a microbenchmark. Same for the "install rAF watcher before play() and pause() in the same tick to freeze at the captured time" pattern in what I assume 05-drift.ts uses.

The scrub_latency_p95_inline vs scrub_latency_p95_isolated split directly pins the value of #397's sync path as a measurable metric — monkey-patching _trySyncSeek to () => false to force the postMessage path in the same page load is a clean way to separate the modes without needing two separate runs.

Three non-blocking observations:

  1. With runs: 3 and 10 seeks per mode per run, that's 30 samples per mode per shard for p95. That's on the edge of "stable enough" — a single outlier at index 28/29 can swing the p95 by tens of ms. Not a blocker (you're in measure mode), but if you see p95 flapping in the first few enforcement cycles, bumping to runs: 5 is the cheapest fix.

  2. MATCH_TOLERANCE_S = 0.05 is generous. On tight-latency scrubs, a 50ms tolerance window between seek command and confirmation paint could mask a legitimate regression where the measured latency is ~30ms but the tolerance swallows the last rAF. Worth revisiting once real baselines land.

  3. The drift scenario (which I haven't read line-by-line) is the one most likely to produce flaky signal, since it's inherently long-running. Keep an eye on its coefficient of variation over the first week — if it's >20%, that's the signal to tighten the driftMaxMs/driftP95Ms baselines and investigate whether there's a non-deterministic timing source in the runtime.

Approved.

Rames Jusso

@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from acdf9af to 111e128 Compare April 22, 2026 00:43
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 725bc89 to 0af9ce7 Compare April 22, 2026 00:43
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this stack into a local worktree, ran the perf harness end-to-end, and compared the top branch against main by swapping the built player/runtime artifacts under the same harness. The scrub/drift gains are real but modest: inline scrub p95 improved from 7.2ms on main to 7.0ms here, isolated scrub p95 improved from 8.2ms to 7.1ms, drift max improved from 25.67ms to 24.67ms, and drift p95 improved from 25.33ms to 24.0ms.

The blocking issue is the FPS scenario. packages/player/tests/perf/scenarios/02-fps.ts is measuring raw requestAnimationFrame cadence and then comparing it against a 60fps target. On my machine both main and this branch reported ~120fps for the same fixture, which means the metric is saturating to browser/display cadence rather than proving “player sustained 60fps playback.” With the current implementation, a high-refresh runner can make fpsMin: 55 in baseline.json look comfortably green without actually telling us whether playback stayed near the intended 60fps budget.

I’d like to see this normalized to a refresh-rate-independent signal before we merge the scenario as a regression gate. Concretely: either derive the metric from missed target intervals against the 60fps composition clock, or capture an effective render cadence that is explicitly bounded to the fixture/runtime target instead of host rAF speed.

Once that part is fixed, I’m comfortable with the rest of the scenario design. The alternating seek targets, iframe-side timing, and drift sampling approach all looked sound in local runs.

@vanceingalls vanceingalls changed the base branch from perf/p0-1a-perf-test-infra to graphite-base/400 April 22, 2026 00:56
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 111e128 to 306c164 Compare April 22, 2026 00:57
@vanceingalls vanceingalls changed the base branch from graphite-base/400 to perf/p0-1a-perf-test-infra April 22, 2026 00:57
@jrusso1020
Copy link
Copy Markdown
Collaborator

Following up on @miguel-heygen's stress-test finding — agreed that the current FPS scenario saturates at the runner's display refresh. On a 120Hz runner, requestAnimationFrame hands back ticks at ~8.3ms intervals regardless of whether the player's composition loop is running at the intended 60fps or is silently stalling between frames, so fpsMin: 55 passes trivially. I missed this in my approval — Miguel is right that it needs to change before it gates merges.

A few approaches that remove the refresh-rate dependency:

1. Composition-time-advanced-per-wall-second (my first choice). In the iframe, sample __player.getTime() at regular wall-clock intervals (say every 100ms via setInterval) across the measurement window. The emitted metric is (finalGetTime - initialGetTime) / wallClockSeconds. When the player keeps up with the composition clock it reads 1.0 ± jitter; when it falls behind (slow decoder, blocked main thread) it drops below 1.0. Display rate drops out of the equation entirely because we're comparing two timestamps that both live in the composition's frame of reference against real wall time, not counting rAF ticks.

Bonus: this is the metric that actually answers "did the composition play at its intended speed," which is the user-observable thing. Display refresh only matters if it's lower than the composition fps — at 60Hz with a 60fps composition the metric would still read ~1.0; at 30Hz displaying a 60fps composition it'd read ~0.5 and legitimately flag the bad experience.

2. Missed-deadline rate. In the iframe's rAF loop, count ticks where the delta since the previous tick exceeded (1000 / target_fps) * 1.2 — i.e., late by more than 20% of the per-frame budget. Metric is missedDeadlines / totalFrames. Bounded and refresh-independent (a 120Hz runner just gets more samples per wall second, with the same passing rate if the player keeps up).

3. PerformanceObserver + frame-timing. new PerformanceObserver({ type: "frame" }) gives you actual frame-timing entries from Chrome with startTime and renderStart. More reliable than rAF but more complex to wire inside the iframe. Probably overkill for this first version.

Option 1 is the simplest and the most directly answers "is the player sustaining playback" — the metric has a physical interpretation rather than being a threshold crossing. Happy to re-review once this lands.

Baseline-wise: fpsMin: 55 would become something like compositionTimeAdvancementRatioMin: 0.95 (or pick your tolerance) and the direction flips — still lower-is-worse, so higher-is-better in the perf-gate. No other changes needed to the harness.

Everything else in the scenario design — alternating seek targets, in-tick pause, drift sampling — held up under Miguel's local run and my static read, so I think just the fps metric needs rework. The rest of my non-blocking notes (samples/shard, tolerance windows) stand but are secondary.

Rames Jusso

@vanceingalls
Copy link
Copy Markdown
Collaborator Author

@jrusso1020 @miguel-heygen — thanks for the careful read. Blocker addressed plus the three non-blocking notes:

Blocking — FPS metric saturating at display refresh (jrusso1020): addressed in 2256f558 (fix(player-perf): replace fpsMin with composition-time-advancement-ratio metric).

The scenario no longer measures requestAnimationFrame cadence at all. Instead it samples __player.getTime() at fixed wall-clock intervals over the playback window and emits the ratio (deltaCompositionTime / deltaWallTime). A perfect player produces 1.0; a player that drops half its production produces 0.5. The metric is bounded by the composition's own clock rather than the host's refresh rate, so a 120Hz runner and a 60Hz runner produce the same number for the same playback. Updated perf-gate.ts to compare on compositionTimeAdvancementRatioMin and bumped baseline.json to a corresponding floor (0.95, with rationale captured in the file).

The non-blocking observations:

30 samples/shard for p95 is on the edge of stable

Documented DEFAULT_RUNS = 30 rationale in index.ts — the picked-from-the-air number is now grounded in "p95 binomial CI is acceptable when we run on a dedicated runner". If the rolling p95 starts to wobble we should bump to 50; tracked in the inline comment so the next person knows what to change.

MATCH_TOLERANCE_S = 0.05 is generous

Same treatment — documented in 04-scrub.ts with the rationale (one frame at 24fps ≈ 0.042s, so 0.05s is "one frame at 24fps + a little headroom for the seek RTT") and a TODO to tighten to 0.02s once the synchronous-seek path from #397 is the default everywhere. Today loosening it would mask real regressions; tightening prematurely would flap on cross-origin seeks.

Drift scenario coefficient-of-variation worth tracking as a metric

Added the CV log line in 05-drift.ts so we can eyeball it in CI runs, plus a TODO to elevate it to a tracked metric in perf-gate.ts once we have a baseline distribution. Holding off on gating until we have ~10 runs of evidence on what "normal CV" looks like across the matrix.

Nothing else outstanding.

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-checked the live head and validated the requested-change blocker is fixed. The FPS scenario now measures composition-time advancement rather than raw rAF cadence, and the local player perf run passed on the reviewed head.

@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 433b609 to dbe8090 Compare April 22, 2026 22:20
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 2256f55 to 3c347f7 Compare April 22, 2026 22:20
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the incremental changes at head (3c347f7c) since my prior approval on commit acdf9af3. The FPS scenario has been redesigned to address Miguel's blocker, and my earlier non-blocking notes got explicit treatment — clean re-approval.

FPS metric replacement (addressing the 120Hz-saturates blocker):

02-fps.ts no longer counts raw requestAnimationFrame ticks. New design:

  • Samples __player.getTime() at a fixed wall-clock cadence (setInterval(100ms)) for a 5s window.
  • Emits composition_time_advancement_ratio_min = (getTime(end) - getTime(start)) / wallSeconds.
  • Reads ~1.0 when the player keeps up with its intended speed, falls below 1.0 when it stalls. Refresh-rate independent by construction — both numerator and denominator are wall-clock-derived, neither is a frame count, so a 60 / 120 / 240Hz runner converges to the same value.
  • baseline.json drops fpsMin: 55, adds compositionTimeAdvancementRatioMin: 0.95. The PerfBaseline JSDoc clearly documents the semantic shift.

The "reset sample buffer after isPlaying() === true" gotcha is preserved in the new implementation — samples captured during the postMessage play ramp-up would compare a stale comp=0 against an advancing wall clock and bias the ratio toward 0. Good.

Non-FPS scenarios unchanged from prior approval, and my non-blocking notes got explicit TODOs:

  • 04-scrub.ts::MATCH_TOLERANCE_S = 0.05 now has an expanded JSDoc explaining the three sources of slack (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) plus a TODO to tighten to 16ms once baselines land and optionally split per-mode.
  • 05-drift.ts now computes + logs CV (stddev/mean) alongside max and p95, with a TODO to decide whether to publish CV as a tracked-but-ungated baseline after 2 weeks of CI data. That's exactly the early-warning signal I asked for — if CV climbs while max/p95 stay green, the 50ms setInterval assumption has shifted and the baselines are about to flake.

Approving the incremental work.

Review by pr-review

@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from dbe8090 to 4f09c94 Compare April 22, 2026 22:36
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 3c347f7 to 1f13f80 Compare April 22, 2026 22:36
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 4f09c94 to a08c782 Compare April 22, 2026 22:43
vanceingalls added a commit that referenced this pull request Apr 22, 2026
…tio metric

Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020:
the previous FPS metric measured raw rAF cadence and was refresh-rate
dependent (a 30fps composition would always 'pass' a 60Hz refresh rate
but the metric reported on rAF, not the composition).

- 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall
  intervals and emit (deltaCompTime / wallSeconds). New metric is
  refresh-rate independent and measures what we actually care about:
  whether the player keeps up with composition-time playback.
- perf-gate.ts: replaced fpsMin / droppedFramesMax with
  compositionTimeAdvancementRatioMin (higher-is-better, target 0.95).
- baseline.json: updated to match the new metric key + threshold.
- 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization
  on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to
  tighten once we have CI baseline data.
- 05-drift.ts: log coefficient of variation as a soft monitoring signal
  (not gated) + TODO to decide whether to publish it as a tracked metric.
- index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over
  n=3 is just max; fps/scrub/drift=3 because they pool samples across
  runs) + TODO to revisit fps=3 after collecting CI baseline data.
- index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch 2 times, most recently from 621a276 to 84efd8a Compare April 22, 2026 23:29
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch 2 times, most recently from 2dc4b8f to 407813c Compare April 22, 2026 23:42
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 84efd8a to 5a93e44 Compare April 22, 2026 23:42
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 407813c to ab395e9 Compare April 22, 2026 23:48
vanceingalls added a commit that referenced this pull request Apr 22, 2026
…tio metric

Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020:
the previous FPS metric measured raw rAF cadence and was refresh-rate
dependent (a 30fps composition would always 'pass' a 60Hz refresh rate
but the metric reported on rAF, not the composition).

- 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall
  intervals and emit (deltaCompTime / wallSeconds). New metric is
  refresh-rate independent and measures what we actually care about:
  whether the player keeps up with composition-time playback.
- perf-gate.ts: replaced fpsMin / droppedFramesMax with
  compositionTimeAdvancementRatioMin (higher-is-better, target 0.95).
- baseline.json: updated to match the new metric key + threshold.
- 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization
  on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to
  tighten once we have CI baseline data.
- 05-drift.ts: log coefficient of variation as a soft monitoring signal
  (not gated) + TODO to decide whether to publish it as a tracked metric.
- index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over
  n=3 is just max; fps/scrub/drift=3 because they pool samples across
  runs) + TODO to revisit fps=3 after collecting CI baseline data.
- index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 5a93e44 to 871c986 Compare April 22, 2026 23:49
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from ab395e9 to 70e61ca Compare April 23, 2026 00:46
vanceingalls added a commit that referenced this pull request Apr 23, 2026
…tio metric

Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020:
the previous FPS metric measured raw rAF cadence and was refresh-rate
dependent (a 30fps composition would always 'pass' a 60Hz refresh rate
but the metric reported on rAF, not the composition).

- 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall
  intervals and emit (deltaCompTime / wallSeconds). New metric is
  refresh-rate independent and measures what we actually care about:
  whether the player keeps up with composition-time playback.
- perf-gate.ts: replaced fpsMin / droppedFramesMax with
  compositionTimeAdvancementRatioMin (higher-is-better, target 0.95).
- baseline.json: updated to match the new metric key + threshold.
- 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization
  on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to
  tighten once we have CI baseline data.
- 05-drift.ts: log coefficient of variation as a soft monitoring signal
  (not gated) + TODO to decide whether to publish it as a tracked metric.
- index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over
  n=3 is just max; fps/scrub/drift=3 because they pool samples across
  runs) + TODO to revisit fps=3 after collecting CI baseline data.
- index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 871c986 to 224503d Compare April 23, 2026 00:46
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 70e61ca to 658e30c Compare April 23, 2026 00:51
vanceingalls added a commit that referenced this pull request Apr 23, 2026
…tio metric

Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020:
the previous FPS metric measured raw rAF cadence and was refresh-rate
dependent (a 30fps composition would always 'pass' a 60Hz refresh rate
but the metric reported on rAF, not the composition).

- 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall
  intervals and emit (deltaCompTime / wallSeconds). New metric is
  refresh-rate independent and measures what we actually care about:
  whether the player keeps up with composition-time playback.
- perf-gate.ts: replaced fpsMin / droppedFramesMax with
  compositionTimeAdvancementRatioMin (higher-is-better, target 0.95).
- baseline.json: updated to match the new metric key + threshold.
- 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization
  on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to
  tighten once we have CI baseline data.
- 05-drift.ts: log coefficient of variation as a soft monitoring signal
  (not gated) + TODO to decide whether to publish it as a tracked metric.
- index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over
  n=3 is just max; fps/scrub/drift=3 because they pool samples across
  runs) + TODO to revisit fps=3 after collecting CI baseline data.
- index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 224503d to f637194 Compare April 23, 2026 00:51
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch 2 times, most recently from db957ee to cc6039f Compare April 23, 2026 00:59
vanceingalls added a commit that referenced this pull request Apr 23, 2026
…tio metric

Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020:
the previous FPS metric measured raw rAF cadence and was refresh-rate
dependent (a 30fps composition would always 'pass' a 60Hz refresh rate
but the metric reported on rAF, not the composition).

- 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall
  intervals and emit (deltaCompTime / wallSeconds). New metric is
  refresh-rate independent and measures what we actually care about:
  whether the player keeps up with composition-time playback.
- perf-gate.ts: replaced fpsMin / droppedFramesMax with
  compositionTimeAdvancementRatioMin (higher-is-better, target 0.95).
- baseline.json: updated to match the new metric key + threshold.
- 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization
  on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to
  tighten once we have CI baseline data.
- 05-drift.ts: log coefficient of variation as a soft monitoring signal
  (not gated) + TODO to decide whether to publish it as a tracked metric.
- index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over
  n=3 is just max; fps/scrub/drift=3 because they pool samples across
  runs) + TODO to revisit fps=3 after collecting CI baseline data.
- index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from f637194 to 23e3dcb Compare April 23, 2026 00:59
@vanceingalls vanceingalls changed the base branch from perf/p0-1a-perf-test-infra to graphite-base/400 April 23, 2026 01:04
vanceingalls added a commit that referenced this pull request Apr 23, 2026
## Summary

First slice of `P0-1` from the player perf proposal: lays the foundation for a player perf gate so later PRs can plug in fps / scrub / drift / parity scenarios without rebuilding infrastructure. Ships one smoke scenario (`03-load`, cold + warm composition load) to prove the gate end-to-end on real numbers.

## Why

There was no automated way to catch player perf regressions. Every perf concern in the existing proposal — composition load time, sustained FPS, scrub p95, mirror-clock drift, live-vs-seek parity — needs the same plumbing: a same-origin harness, a Puppeteer runner, a baseline file, a gate that emits structured results, and a CI workflow that runs the right scenarios on the right changes. Building that up-front in one reviewable PR lets every subsequent perf PR (`P0-1b`, `P0-1c`, and beyond) be a 100-line scenario file plus a baseline entry instead of re-litigating the framework.

## What changed

### Harness — `packages/player/tests/perf/server.ts`

- `Bun.serve` on a free port, single same-origin host for the player IIFE bundle, hyperframe runtime, GSAP from `node_modules`, and fixture HTML.
- Same-origin matters: cross-origin would force every probe through `postMessage`, hiding bugs and inflating numbers in ways production never sees. Tests should measure the path the studio editor actually takes.
- Routes:
  - `/player.js` → built IIFE bundle (rebuilt on demand).
  - `/vendor/runtime.js`, `/vendor/gsap.min.js` → resolved from `node_modules` so fixtures don't need to ship copies.
  - `/fixtures/*` → fixture HTML.

### Runner — `packages/player/tests/perf/runner.ts`

- `puppeteer-core` thin wrappers (`launchBrowser`, `loadHostPage`).
- Uses the system Chrome detected by `setup-chrome` in CI rather than the bundled puppeteer revision — keeps the action smaller, lets us pin Chrome version policy at the workflow level, and matches what users actually run.

### Gate — `packages/player/tests/perf/perf-gate.ts` + `baseline.json`

- Loads `baseline.json` (initial budgets: cold/warm comp load, fps, scrub p95 isolated/inline, drift max/p95) with a 10% `allowedRegressionRatio`.
- Per-metric direction (`lower-is-better` / `higher-is-better`) so the same evaluator handles latency and throughput.
- Returns a structured `GateReport` consumed by both the CLI (table output) and `metrics.json` (CI artifact).
- Two modes: `measure` (log only — used during the rollout) and `enforce` (fail the build) — flip per-metric once we trust the signal, without touching the harness.

### CLI orchestrator — `packages/player/tests/perf/index.ts`

- Parses `--mode` / `--scenarios` / `--runs` / `--fixture` in both space- and equals-separated form (so `--scenarios fps,scrub` and `--scenarios=fps,scrub` both work — matches what humans type and what GitHub Actions emits).
- Runs scenarios, runs the gate, and **always** writes `results/metrics.json` with schema version, git SHA, metrics, and gate rows — so failed runs are still investigable from the artifact alone.

### Fixture + smoke scenario

- `fixtures/gsap-heavy/index.html`: 200 stagger-animated tiles, no media. Heavy enough to make load time meaningful, light enough to be deterministic.
- `scenarios/03-load.ts`: cold + warm composition load. Measures from navigation start to player `ready` event, reports p95 across runs.

### CI — `.github/workflows/player-perf.yml`

- `paths-filter` on `player` / `core` / `runtime` — perf only runs when something that could move the needle actually changed.
- Sets up bun + node + chrome, runs perf in `measure` mode on a shard matrix (so future scenarios shard naturally), uploads `metrics.json` artifacts, and a summary job aggregates shard results into a single PR comment.

### Wiring

- `packages/player`: `puppeteer-core`, `gsap`, `@types/bun` devDeps; typecheck extended to cover the perf `tsconfig`; new `perf` script.
- Root `package.json`: `player:perf` workspace script so `bun run player:perf` runs the whole suite locally with the same flags CI uses.
- `.gitignore`: `packages/player/tests/perf/results/`.
- Separate `tests/perf/tsconfig.json` so test code doesn't pollute the package `rootDir` while still being typechecked.

## Test plan

- [x] Local: `bun run player:perf` passes — cold p95 ≈ 386 ms, warm p95 ≈ 375 ms, both well under the seeded baselines.
- [x] Typecheck, lint, format pass on the perf workspace.
- [x] Existing player unit tests (71/71) still green.
- [ ] First CI run after merge will be the real signal: confirms `setup-chrome` works on hosted runners, the shard matrix wires up, and `metrics.json` artifacts upload.

## Stack

Step `P0-1a` of the player perf proposal. The next two slices are content-only — they don't touch the harness:

- `P0-1b` (#400): adds `02-fps`, `04-scrub`, `05-drift` scenarios on a 10-video-grid fixture.
- `P0-1c` (#401): adds `06-parity` (live playback vs. synchronously-seeked reference, compared via SSIM).

Wiring this gate up first means each follow-up is a self-contained scenario file + baseline row + workflow shard.
vanceingalls added a commit that referenced this pull request Apr 23, 2026
…tio metric

Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020:
the previous FPS metric measured raw rAF cadence and was refresh-rate
dependent (a 30fps composition would always 'pass' a 60Hz refresh rate
but the metric reported on rAF, not the composition).

- 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall
  intervals and emit (deltaCompTime / wallSeconds). New metric is
  refresh-rate independent and measures what we actually care about:
  whether the player keeps up with composition-time playback.
- perf-gate.ts: replaced fpsMin / droppedFramesMax with
  compositionTimeAdvancementRatioMin (higher-is-better, target 0.95).
- baseline.json: updated to match the new metric key + threshold.
- 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization
  on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to
  tighten once we have CI baseline data.
- 05-drift.ts: log coefficient of variation as a soft monitoring signal
  (not gated) + TODO to decide whether to publish it as a tracked metric.
- index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over
  n=3 is just max; fps/scrub/drift=3 because they pool samples across
  runs) + TODO to revisit fps=3 after collecting CI baseline data.
- index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 23e3dcb to 33150d2 Compare April 23, 2026 01:04
@graphite-app graphite-app Bot changed the base branch from graphite-base/400 to main April 23, 2026 01:04
…tio metric

Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020:
the previous FPS metric measured raw rAF cadence and was refresh-rate
dependent (a 30fps composition would always 'pass' a 60Hz refresh rate
but the metric reported on rAF, not the composition).

- 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall
  intervals and emit (deltaCompTime / wallSeconds). New metric is
  refresh-rate independent and measures what we actually care about:
  whether the player keeps up with composition-time playback.
- perf-gate.ts: replaced fpsMin / droppedFramesMax with
  compositionTimeAdvancementRatioMin (higher-is-better, target 0.95).
- baseline.json: updated to match the new metric key + threshold.
- 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization
  on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to
  tighten once we have CI baseline data.
- 05-drift.ts: log coefficient of variation as a soft monitoring signal
  (not gated) + TODO to decide whether to publish it as a tracked metric.
- index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over
  n=3 is just max; fps/scrub/drift=3 because they pool samples across
  runs) + TODO to revisit fps=3 after collecting CI baseline data.
- index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 33150d2 to b9fd169 Compare April 23, 2026 01:04
@vanceingalls vanceingalls merged commit 6f05fab into main Apr 23, 2026
27 checks passed
Copy link
Copy Markdown
Collaborator Author

Merge activity

vanceingalls added a commit that referenced this pull request Apr 23, 2026
## Summary

Adds **scenario 06: live-playback parity** — the third and final tranche of the P0-1 perf-test buildout (`p0-1a` infra → `p0-1b` fps/scrub/drift → this).

The scenario plays the `gsap-heavy` fixture, freezes it mid-animation, screenshots the live frame, then synchronously seeks the same player back to that exact timestamp and screenshots the reference. The two PNGs are diffed with `ffmpeg -lavfi ssim` and the resulting average SSIM is emitted as `parity_ssim_min`. Baseline gate: **SSIM ≥ 0.95**.

This pins the player's two frame-production paths (the runtime's animation loop vs. `_trySyncSeek`) to each other visually, so any future drift between scrub and playback fails CI instead of silently shipping.

## Motivation

`<hyperframes-player>` produces frames two different ways:

1. **Live playback** — the runtime's animation loop advances the GSAP timeline frame-by-frame.
2. **Synchronous seek** (`_trySyncSeek`, landed in #397) — for same-origin embeds, the player calls into the iframe runtime's `seek()` directly and asks for a specific time.

These paths must agree. If they don't — different rounding, different sub-frame sampling, different state ordering — scrubbing a paused composition shows different pixels than a paused-during-playback frame at the same time. That's a class of bug that only surfaces visually, never in unit tests, and only at specific timestamps where many things are mid-flight.

`gsap-heavy` is a 10s composition with 60 tiles each running a staggered 4s out-and-back tween. At t=5.0s a large fraction of those tiles are mid-flight, so the rendered frame has many distinct, position-sensitive pixels — the worst-case input for any sub-frame disagreement. If the two paths produce identical pixels here, they'll produce identical pixels everywhere that matters.

## What changed

- **`packages/player/tests/perf/scenarios/06-parity.ts`** — new scenario (~340 lines). Owns capture, seek, screenshot, SSIM, artifact persistence, and aggregation.
- **`packages/player/tests/perf/index.ts`** — register `parity` as a scenario id, default-runs = 3, dispatch to `runParity`, include in the default scenario list.
- **`packages/player/tests/perf/perf-gate.ts`** — extend `PerfBaseline` with `paritySsimMin`.
- **`packages/player/tests/perf/baseline.json`** — `paritySsimMin: 0.95`.
- **`.github/workflows/player-perf.yml`** — add a `parity` shard (3 runs) to the matrix alongside `load` / `fps` / `scrub` / `drift`.

## How the scenario works

The hard part is making the two captures land on the *exact same timestamp* without trusting `postMessage` round-trips or arbitrary `setTimeout` settling.

1. **Install an iframe-side rAF watcher** before issuing `play()`. The watcher polls `__player.getTime()` every animation frame and, the first time `getTime() >= 5.0`, calls `__player.pause()` *from inside the same rAF tick*. `pause()` is synchronous (it calls `timeline.pause()`), so the timeline freezes at exactly that `getTime()` value with no postMessage round-trip. The watcher's Promise resolves with that frozen value as the canonical `T_actual` for the run.
2. **Confirm `isPlaying() === true`** via `frame.waitForFunction` before awaiting the watcher. Without this, the test can hang if `play()` hasn't kicked the timeline yet.
3. **Wait for paint** — two `requestAnimationFrame` ticks on the host page. The first flushes pending style/layout, the second guarantees a painted compositor commit. Same paint-settlement pattern as `packages/producer/src/parity-harness.ts`.
4. **Screenshot the live frame** — `page.screenshot({ type: "png" })`.
5. **Synchronously seek to `T_actual`** — call `el.seek(capturedTime)` on the host page. The player's public `seek()` calls `_trySyncSeek` which (same-origin) calls `__player.seek()` synchronously, so no postMessage await is needed. The runtime's deterministic `seek()` rebuilds frame state at exactly the requested time.
6. **Wait for paint** again, screenshot the reference frame.
7. **Diff with ffmpeg** — `ffmpeg -hide_banner -i reference.png -i actual.png -lavfi ssim -f null -`. ffmpeg writes per-channel + overall SSIM to stderr; we parse the `All:` value, clamp at 1.0 (ffmpeg occasionally reports 1.000001 on identical inputs), and treat it as the run's score.
8. **Persist artifacts** under `tests/perf/results/parity/run-N/` (`actual.png`, `reference.png`, `captured-time.txt`) so CI can upload them and so a failed run is locally reproducible. Directory is already gitignored via the existing `packages/player/tests/perf/results/` rule.

### Aggregation

`min()` across runs, **not** mean. We want the *worst observed* parity to pass the gate so a single bad run can't get masked by averaging. Both per-run scores and the aggregate are logged.

### Output metric

| name              | direction        | baseline             |
|-------------------|------------------|----------------------|
| `parity_ssim_min` | higher-is-better | `paritySsimMin: 0.95` |

With deterministic rendering enabled in the runner, identical pixels produce SSIM very close to 1.0; the 0.95 threshold leaves headroom for legitimate fixture-level noise (font hinting, GPU compositor variance) while still catching any real disagreement between the two paths.

## Test plan

- `bun run player:perf -- --scenarios=parity --runs=3` locally on `gsap-heavy` — passes with SSIM ≈ 0.999 across all 3 runs.
- Inspected `results/parity/run-1/actual.png` and `reference.png` side-by-side — visually identical.
- Inspected `captured-time.txt` to confirm `T_actual` lands just past 5.0s (within one frame).
- Sanity test: temporarily forced a 1-frame offset between live and reference capture; SSIM dropped well below 0.95 as expected, confirming the threshold catches real drift.
- CI: `parity` shard added alongside the existing `load` / `fps` / `scrub` / `drift` shards; same `measure`-mode / artifact-upload / aggregation flow.
- `bunx oxlint` and `bunx oxfmt --check` clean on the new scenario.

## Stack

This is the top of the perf stack:

1. #393 `perf/x-1-emit-performance-metric` — performance.measure() emission
2. #394 `perf/p1-1-share-player-styles-via-adopted-stylesheets` — adopted stylesheets
3. #395 `perf/p1-2-scope-media-mutation-observer` — scoped MutationObserver
4. #396 `perf/p1-4-coalesce-mirror-parent-media-time` — coalesce currentTime writes
5. #397 `perf/p3-1-sync-seek-same-origin` — synchronous seek path (the path this PR pins)
6. #398 `perf/p3-2-srcdoc-composition-switching` — srcdoc switching
7. #399 `perf/p0-1a-perf-test-infra` — server, runner, perf-gate, CI
8. #400 `perf/p0-1b-perf-tests-for-fps-scrub-drift` — fps / scrub / drift scenarios
9. **#401 `perf/p0-1c-live-playback-parity-test` ← you are here**

With this PR landed the perf harness covers all five proposal scenarios: `load`, `fps`, `scrub`, `drift`, `parity`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants